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Abstract. This paper presents an automatic dialogue scoring approach for a 
Dialogue-Based Computer-Assisted Language Learning (DB-CALL) system, 
which helps users learn language via interactive conversations. The system produces 
overall feedback according to dialogue scoring to help the learner know which 
parts should be more focused on. The scoring measures are presented, including 
task proficiency, grammar accuracy, vocabulary knowledge, and syntactic ability, 
to assess the user performance during the dialogue. A user evaluation is performed 
on the automatic dialogue scoring results and the generated feedback to collect the 
feedback from real learners, and to see if the measures are helpful and proper. A 
discussion is also held about the difference between the automatic dialogues scoring 
from essay scoring based on the user evaluation. 
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1. introduction 


A DB-CALL system usually provides grammar correction feedback with a 
grammar checker, and discourse feedback via a semantic checker. We have 
developed GenieTutor (Kwon, Lee, Kim, & Lee, 2015a; Kwon et al., 2015b), 
which is a DB-CALL system for English learners in Korea. GenieTutor leads 
dialogues by asking questions on different topics according to given scenarios, 
language learners answer questions orally, and the system recognises the speech. 
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evaluates if the answers are semantically proper for given questions, and checks 
grammatical errors and provides feedback (Lee, Kwon, Kim, & Lee, 2015). The 
dialogues normally consist of two to four turns, and the system provides semantic 
and grammar error feedback in each user utterance, deciding if the dialogue can 
move to the next turn. During the development and user tests, we noticed that 
users would like to know their overall scoring and level after finishing a whole 
dialogue. 

In this paper, we investigate the dialogue scoring measures for the GenieTutor 
system. The measures include task proficiency, grammar accuracy, utterance length 
and complexity, vocabidary level and diversity. Synonyms are also provided as 
suggestions to improve user vocabulary. 


2. Measures for automatic dialogue scoring 

Speech scoring has focused on restricted and highly predictable speech, mainly 
evaluating aspects of speaking related features, including pronunciation, intonation, 
rhythm, and fluency, such as speaking rate or length and distribution of pauses. 
For automated scoring of unrestricted spontaneous speech, more speaking content 
related features are adopted, including grammatical accuracy, syntactic complexity, 
vocabulary diversity, and spoken discourse structure (Chen & Zechner, 2011; Xie, 
Evanini, & Zechner, 2012). These speaking content related measures are similar to 
the essay scoring, because they both intend to assess communicative competence 
(Attali & Burstein, 2004). 

Our DB-CALL system has expected user answers for each system utterance, similar 
with the restricted speech scoring. However, considering it is a dialogue system, some 
factors of unrestricted spontaneous speech should also be considered in the scoring. 

The first measure investigated for dialogue scoring is dialogue proficiency, 
indicating how fluently the conversation has been maintained, it consists of task 
turn pass ratio and user utterance pass ratio, where task turn pass ratio is the ratio 
of the passed turns out of all task turns. For example, there are 3 turns predefined 
for the scenario, if the learner passes two turns and gives up in the final turn, than 
the task turn pass ratio is 66.7%. 

User utterance pass ratio is the ratio of the passed user utterances out of all user 
utterances. For example, for a dialogue with two task turns, the learner finished it 
with five utterances. Then, there are two task turns, two passed user utterances out 
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of all five user utterances, then the user utterance pass ratio is 40% while the task 
turn pass ratio is 100%. If a user utterance passes, the task turn is performed by the 
semantic correctness module (Kwon et al., 2015a). 

The second measure is grammar accuracy. Grammar checks have been performed 
by grammatical error correction modules in each turn (Lee et al.. 2015). What 
dialogue scoring needs to do is compute the accuracy according to the weighted 
number of grammar errors, dividing this by the total number of words in all user 
utterances. This measure is the same with essay scoring (Attali & Burstein, 2004). 

The third measure is vocabulary, including vocabulary level and diversity. Vocabulary 
level has five categories, from primary school level to university level. Vocabulary 
level estimates the user word level according to the word distributions by dividing 
the number of user words in zth category (/uiw.i) to the number of user words (/raw), 
and compares it with the vocabulary level of the scenario - a scenario which provides 
correct references from native speakers. Each category has different a weight vvi. The 
vocabulary level would be set to one if the dividing result is higher than one. 

Vocabulary diversity is the ratio of number of word types to tokens in the user 
utterances (Attali & Burstein, 2004). However, different from essay scoring, 
the ratio should be a relative one compared with the vocabulary diversity of the 
scenario. For example, in user utterances “Movies interest me a lot”, “I’m interest 
in Action Movies a lot”, there are nine word types, “movies, interest, me, a, lot, 
I, am, in, action” and 13 tokens, so the diversity is 0.69 (9/13). Again, it is also 
divided into the vocabulary diversity of the scenario, which would be set to one if 
the dividing result is higher than one. 

The system provides synonyms and similar expressions to improve user vocabulary 
if the same word is adopted several times in the user utterances. For above cases, 
GenieTutor will suggest ‘See also these similar expressions: interest > fascinate, 
attract, entertain’. 

Syntactic ability includes utterance length and syntactic complexity, the former relates 
with the utterance lengths, while the complexity considers the syntactic structure of 
the utterances. Utterance length compares the lengths of user sentences to the length 
of references in the scenario, and syntactic complexity gives relative complexity 
scores by considering the length of the utterances and the number of conjunctions. 

Dialogue score is the weighted average of all above measures. The system provides 
overall feedback according to the dialogue score (Figure 1). 
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3. Survey and discussion 

A survey on the GenieTutor overall feedback are performed involving 30 human 
evaluators, 14 of them elementary English learners, and 16 intermediate learners. 
They are asked five questions, from range one to five, from ‘Strongly disagree’ to 
‘Strongly agree’, respectively. The evaluators needed to pick at least one item for 
the final question. 


Table 1. User evaluation on the overall feedback generated from dialogue scoring 


Idx 

Questions 

Score 

1 

Do you think overall feedback would be helpful 
to improve your conversation level? 

3.53 

2 

Do you think overall feedback would be helpful to motivate your learning? 

3.67 

3 

Do you think the evaluation items of the overall feedback are proper? 

3.97 

4 

Do you think the overall score and the final feedback are proper? 

3.50 

5 

Which evaluation item(s) do you 
think unnecessary among the 
overall feedback? You can pick one 
or more from following items : 

A. Overall score 

4 

B. Task proficiency 

8 

C. Grammar 

1 

D. Vocabulary 

1 

E. Syntactic 

17 


From Table 1, we can see that the human evaluators considered the overall 
feedback tend to be helpful to their English learning (average scores=3.53/3.67 
for the first and second questions). About the evaluation items of the overall 
feedback, the users think measuring items are proper (average score=3.97 for 
the 3rd question). However, the scoring of the items are considered as just tend 
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to be proper (average score=3.50 for the 4th question), implying that the scoring 
approach still needs to be fine-tuned to reflect the learner’s performance more 
accurately. 

Interestingly enough, more than half of the evaluators think the measure Syntactic 
is less necessary (17 votes out of 30 evaluators), while Vocabulary and Grammar 
measures get only one vote, respectively. It indicates dialogue scoring should 
be different from essay scoring considering that the Syntactic measure is one 
of the most important measures in essay scoring. The dialogue in GenieTutor is 
restricted and predictable, which is very different from the essay. For example, 
the learner already learns the dialogue “what kind of movies do you like? — > I 
like Action Movies” from the given class (scenario). Flowever when the learner 
utters the same sentences in the practice, the syntactic complexity measure 
would give a lower score to the user utterance, and suggest try to practice longer 
expressions: I’m interested in Action Movies a lot. The task proficiency gets eight 
votes mostly from the speech recognition problem - the learner complains that, 
when the speech is not recognised correctly, the user utterance would get failure 
in semantic check, it reduces turn pass ratio and reflects the accuracy of task 
proficiency. It means the performance of the speech recognition could impede 
the participants’ views. 


4. Conclusion 

This paper investigated the measures for automatic dialogue scoring and 
performed user evaluation on the overall feedback. The result showed that the 
overall feedback after a dialogue tended to be helpful to the language learner, even 
if there were already turn-by-turn feedback provided for semantic and grammar 
error correction. The user evaluation result also showed that the dialogue scoring 
for a DB-CALL system should be different from automatic essay scoring in some 
measures - in our case, the syntactic ability measure was considered less helpful 
than others, while grammar and vocabulary measures were considered necessary 
with the overall score. 
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